At the core of every ML problem lies the data. Often data directly from the real world is messy, untidy, and needs extensive preparation before we can even begin to use it. This assignment simulates the process of data cleaning and preparation on an example real-world task: the NFLRush dataset. This dataset is from a completed competition on Kaggle. You are discouraged from looking up the competition page, and it will be considered plagarism if you directly copy code from public notebooks.
Here a couple noteworthy links to documentations/guides:
These will be immensely helpful throughout this assignment. If you have any questions, feel free to reach out on the discussion boards.
A quick note before you begin: This isn't supposed to be tedious! The intended solution runs more or less within 10s depending on your machine. Less-ideal solutions can take many minutes or even hours if you're not careful. HOWEVER, for this assignment just focus on getting the data right and don't stress about runtime, it's ok if your program takes hours to run.
Hint: Make sure you leave plenty of time for part j in the data cleaning and feature engineering section, as it is the most crucial part to processing the data.
Code this: To be able to get the data and process it, we first need to import several libraries. If you haven't installed these libraries, install them first, and then run these lines. Include these lines under part a in your submission.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import seaborn as sns
sns.set()
import math
Code this: Included with the assignment description is a file named "data.csv". Put this file somewhere your script/notebook can access and then import it using pandas (put import code under part b of submission), storing it in a variable named "data_raw". If you see a warning about mixed types, don't worry about it.
Dataset information: This dataset contains data about real NFL games, specifically plays where the offensive team rushes (no passing of the ball. One player, the rusher, takes the ball and runs forward).
Here's the wikipedia link for American football if you're not familiar: https://en.wikipedia.org/wiki/American_football
And here's a link to a description of what "rushing" means: https://www.sportingcharts.com/dictionary/nfl/rush.aspx
In this dataset, each row corresponds to a single player's information (defense team or offense team) in a given rushing play (there are many plays per game). We will predict in a later assignment how many yards each rushing play might gain (yards lost are negative values), which means that each observation for which a label is given is actually spread across 22 rows (there are 22 players involved in each play). Your task is to clean and tidy this dataset, or process it so that the values are consistent, the mistakes are removed, and each observation corresponds to a single row (each rushing play on one row that contains data about all 22 players). This will make more sense once we start doing EDA.
Here's a description of what each column means in the dataset (remember they specify information about a single player):
Above is a figure of a typical NFL field. The specific values may not all line up with what's in the data.
Although, as practitioners of ML, we don't need to analyze data as extensively as statisticians, in order to make better predictions some data analysis is necessary. Instead of calling it analysis, we will call it exploratory anaylsis to not anger the statisticians.
The goal of exploratory analysis is so that you get comfortable with the dataset. Nothing fancy. By the end of EDA you should be able to confidently present this dataset and talk about what's in it.
We'll start by taking a brief look at what a single data point (a single play) looks like. As mentioned above, each data point is spread across 22 rows, each with information about a specific player on the field. We'll use the play with PlayId == 20181115001638 as an example:
Code this: Extract all rows where PlayId == 20181115001638, check to make sure it's 22 rows.
Respond in comments: Take a look at this subset. Which columns (give 3 examples) have the same values for all 22 rows and correspond to information about the entire play? Which columns (3 examples) correspond to information about the specific player?
We've looked at the data in table form, but can we plot the positions of each player on the field to get a bird's eye view of the play?
This is mainly tedious work with matplotlib, feel free to write your own code, but below are a couple helpful plotting functions originally by Rob Mulla on Kaggle. They plot each player on the field, and, for the rusher, also plot the direction he is moving in (large blue arrow) as well as where his head is looking (small yellow arrow).
def create_football_field(linenumbers=True,
endzones=True,
highlight_line=False,
highlight_line_number=50,
highlighted_name='Line of Scrimmage',
fifty_is_los=False,
figsize=(12*2, 6.33*2)):
rect = patches.Rectangle((0, 0), 120, 53.3, linewidth=0.1,
edgecolor='r', facecolor='darkgreen', zorder=0)
fig, ax = plt.subplots(1, figsize=figsize)
ax.add_patch(rect)
plt.plot([10, 10, 10, 20, 20, 30, 30, 40, 40, 50, 50, 60, 60, 70, 70, 80,
80, 90, 90, 100, 100, 110, 110, 120, 0, 0, 120, 120],
[0, 0, 53.3, 53.3, 0, 0, 53.3, 53.3, 0, 0, 53.3, 53.3, 0, 0, 53.3,
53.3, 0, 0, 53.3, 53.3, 0, 0, 53.3, 53.3, 53.3, 0, 0, 53.3],
color='white')
if fifty_is_los:
plt.plot([60, 60], [0, 53.3], color='gold')
plt.text(62, 50, '<- Player Yardline at Snap', color='gold')
# Endzones
if endzones:
ez1 = patches.Rectangle((0, 0), 10, 53.3,
linewidth=0.1,
edgecolor='r',
facecolor='blue',
alpha=0.2,
zorder=0)
ez2 = patches.Rectangle((110, 0), 120, 53.3,
linewidth=0.1,
edgecolor='r',
facecolor='blue',
alpha=0.2,
zorder=0)
ax.add_patch(ez1)
ax.add_patch(ez2)
plt.xlim(0, 120)
plt.ylim(-5, 58.3)
plt.axis('off')
if linenumbers:
for x in range(20, 110, 10):
numb = x
if x > 50:
numb = 120 - x
plt.text(x, 5, str(numb - 10),
horizontalalignment='center',
fontsize=20, # fontname='Arial',
color='white')
plt.text(x - 0.95, 53.3 - 5, str(numb - 10),
horizontalalignment='center',
fontsize=20, # fontname='Arial',
color='white', rotation=180)
if endzones:
hash_range = range(11, 110)
else:
hash_range = range(1, 120)
for x in hash_range:
ax.plot([x, x], [0.4, 0.7], color='white')
ax.plot([x, x], [53.0, 52.5], color='white')
ax.plot([x, x], [22.91, 23.57], color='white')
ax.plot([x, x], [29.73, 30.39], color='white')
if highlight_line:
hl = highlight_line_number + 10
plt.plot([hl, hl], [0, 53.3], color='yellow')
plt.text(hl + 2, 50, '<- {}'.format(highlighted_name),
color='yellow')
return fig, ax
def plot_play(play_id, train_df=data_raw):
def get_dx_dy(angle, dist):
cartesianAngleRadians = (450-angle)*math.pi/180.0
dx = dist * math.cos(cartesianAngleRadians)
dy = dist * math.sin(cartesianAngleRadians)
return dx, dy
fig, ax = create_football_field()
train_df.query("PlayId == @play_id and Team == 'away'") \
.plot(x='X', y='Y', kind='scatter', ax=ax, color='orange', s=50, legend='Away')
train_df.query("PlayId == @play_id and Team == 'home'") \
.plot(x='X', y='Y', kind='scatter', ax=ax, color='blue', s=50, legend='Home')
train_df.query("PlayId == @play_id and NflIdRusher == NflId") \
.plot(x='X', y='Y', kind='scatter', ax=ax, color='red', s=100, legend='Rusher')
rusher_row = train_df.query("PlayId == @play_id and NflIdRusher == NflId")
yards_covered = rusher_row["Yards"].values[0]
x = rusher_row["X"].values[0]
y = rusher_row["Y"].values[0]
rusher_dir = rusher_row["Dir"].values[0]
rusher_orientation = rusher_row["Orientation"].values[0]
rusher_speed = rusher_row["S"].values[0]
dx, dy = get_dx_dy(rusher_dir, rusher_speed)
dx_o, dy_o = get_dx_dy(rusher_orientation, rusher_speed/2)
ax.arrow(x, y, dx, dy, length_includes_head=True, width=0.3)
ax.arrow(x, y, dx_o, dy_o, length_includes_head=True, width=0.3, color="yellow")
plt.title(f'Play # {play_id} and yard gain is {yards_covered}', fontsize=20)
plt.show()
# how to use these two functions (one calls the other)
plot_play(play_id=20181115001638, train_df=data_raw)
Code this: Copy the functions and usage example above into your script/notebook, and take a look at the play with ID 20181115001638. Note down your observations in comments: Which direction (left or right) is the offense team going? Is the rusher looking where he is running? How are the defense and offense teams distributed on the field?
Code this: Using these functions, explore several more plays and look at the yard gain specified in the plot title. Make sure to cover both 2017 and 2018 plays. Apart from gaining more comfort with the data and intuition about which kinds of plays have large yard gainage, pay specific attention to the orientation and dir of players in 2017. In the comments, note down any inconsistencies you find (turns out this is indeed a mistake in the data you'll have to correct later).
Let's reflect on what you just did: finding messiness in the dataset and noting it down to be corrected when we clean the data. Imagine if you haven't noticed this inconsistency... then the model cannot make efficient use of player orientation at all! Even worse, since this awkwardness would allow the model to distinguish between 2017 plays and 2018 plays, it could lead to nasty overfitting!
Now that we have a rough sense of what each play looks like, we can proceed to the next step of EDA: univariate analysis. In univariate analysis, we look at each variable individually and observe how they are distributed, noting down observations.
Let's look at a position-related variable first, see if you can interpret it in the context of the field plots you made above.
Code this: Using either seaborn or pyplot, plot and describe (roughly) the distribution of player y-coordinates for ALL plays. Is this what you would expect? Why or why not?
Now it's time to look at our future prediction target! We want to predict the yard gain ("Yards" column) for each play, so it's extremely important that we understand how it is distributed.
Code this: Plot the distribution of Yards for all plays, decribe what you see. Try a variety of plots like distplot, boxplot etc. to truly get a sense of how this column works. Also try out the .describe() function and use it answer these questions: how do the mean and median yard gain compare? What's the max? What the greatest yard loss in this data?
Apart from continuous columns, we can also look at discrete columns:
Code this: Look at and describe the column StadiumType. What is the most common stadium? Also notice the messy spelling...
The next step in EDA would be to look at the interactions between two or more variables, specifically the interactions between each variable and our target: Yards.
Since most of the relevant information will be derived from the processed data, we won't spend too much time here.
Code this: Make a plot of Humidity and Yards. What do you observe? Do you think humidity impacts yard gain meaningfully?
Code this: Do the same thing for StadiumType... does it really have a big impact on Yards? (If your graph is too messy just pick a subsample of the data to graph on)
Ther variation you observed is mostly due to each category having differing numbers of values (more outdoors than indoors). We can run tests to see if the differences is statistically significant, but I will cheat and tell you it's not. We'll drop these columns in the next section.
The NFLRush dataset is rather messy... time to clean it!
For this section we'll put all our data cleaning code in a function called clean_data that takes in the raw data and cleans it.
Code this: Start this function as follows, and for each part below mark clearly in your submission which section of the function does it:
def clean_data(data):
data = data.copy() # DataFrames are objects, we don't to mess with what we passed in
# a)
# your code
# b)
# your code
# etc.
First thing's first, let's drop some useless columns! EDA tells us a lot about which columns are useless, but the best way to find out is to run a model and check which columns help the least. For the sake of this assignment, below are some columns tha are more or less useless for predictions:
["Stadium", "Location", "StadiumType", "Turf", "GameWeather", "Temperature", "Humidity", "WindDirection", "WindSpeed", "FieldPosition", "Week"]
Code this: Drop these columns.
Code this: Look at the columns PossessionTeam, HomeTeamAbbr, and VisitorTeamAbbr. There are a few inconsistencies in spelling of team abbreviations. Correct these, use google if you have to look up what abbreviation belongs to what team.
Code this: To aid further cleaning, we need a column that has value 1 whenever the player in the row is the rusher for the play he is in (his NflId==NflIdRusher), and has value 0 otherwise. Make this column and call it "IsRusher". Check to make sure there is only one row for which this value is true in every single play.
Before the play even happens each side already has a score from previous plays, as noted in "VisitorScoreBeforePlay" and "HomeScoreBeforePlay". However, left as they are they might force the model to be complex than need be. It makes much more sense to frame these as "OffensiveScore" and "DefensiveScore" based on the columns "PossessionTeam" (the team with possession is the offensive team) and "HomeTeamAbbr/VisitorTeamAbbr".
Code this: Make two new columns: "OffensiveScore" and "DefensiveScore". Store in these columns the scores from the offensive team and defensive team as given in "VisitorScoreBeforePlay" and "HomeScoreBeforePlay" and adjusted for home and visitor.
Notice some plays go to the left while other go to the right. This can also confuse our model needlessly. Let's flip our entire field (player positions, orientations etc.) for all those plays that go to the left, so that every play in the data move from left to right.
Code this: There are a lot of tedious, error prone geometry work here, so for this part just copy-paste the code below into your function. It was originally from a kernel by CPMP on Kaggle.
data["ToLeft"] = data.PlayDirection == "left"
data["Dir_rad"] = np.mod(90 - data.Dir, 360) math.pi/180.0
data["Orientation_rad"] = np.mod(90 - data.Orientation, 360) math.pi/180.0
data["TeamOnOffense"] = "home"
data.loc[data.PossessionTeam != data.HomeTeamAbbr, "TeamOnOffense"] = "away"
data["IsOnOffense"] = data.Team == data.TeamOnOffense
data["YardLine_std"] = 100 - data.YardLine
data.loc[data.FieldPosition.fillna("") == data.PossessionTeam, "YardLine_std"] = data.loc[data.FieldPosition.fillna("") == data.PossessionTeam, "YardLine"]
data["YardLine"] = data.YardLine_std
data["X_std"] = data.X
data.loc[data.ToLeft, "X_std"] = 120 - data.loc[data.ToLeft, "X"]
data["X"] = data.X_std - data.YardLine - 10
data["Y_std"] = data.Y
data.loc[data.ToLeft, "Y_std"] = 160/3 - data.loc[data.ToLeft, "Y"]
data["Y"] = data.Y_std
data["Dir_std"] = data.Dir_rad
data["Orientation_std"] = data.Orientation_rad
data.loc[data.ToLeft, "Dir_std"] = np.mod(np.pi + data.loc[data.ToLeft, "Dir_rad"], 2np.pi)
data.loc[data.ToLeft, "Orientation_std"] = np.mod(np.pi + data.loc[data.ToLeft, "Orientation_rad"], 2np.pi)
data["Dir"] = data.Dir_std
data["Orientation"] = data.Orientation_std
Now it's time to address the problem you noticed in part c of the EDA. As mentioned before this could be a pretty big problem if it is not corrected.
Code this: Correct the orientation values for the 2017 season according to your observations in EDA part c. Store the corrected values into the same column as the original ones. Make sure to add $2\pi$ to any values that went below 0.
Now let's handle missing values, which will mess up our modelling. There are NA (one interpretation of this abbreviation is "Not Applicable") values in Dir, Orientation, DefendersInTheBox, and OffenseFormation.
In general, there are two ways of handling missing values:
We will take the second approach as data is scarse.
Code this: Fill NA values in Dir with the mean value of the Dir column, "data["Dir"].mean()". Fill the NA's in DefendersInTheBox the same way. Fill the NA's in Orientation with the value of Dir in the same row (assuming the player looks where he goes). Fill the NA's in OffenseFormation with the string "unknown".
For convenience in what is to come, we can create two columns called "PlayerId" and "UniquePlayId". They are meant to be ID's for each player/play and consist of string concatenating a couple other columns as shown below. UniquePlayId is added as a robustness feature (PlayId is already unique), meaning that in case a problem arises in PlayId when we apply our preprocessing to the real world, it can still handle it without error.
data["PlayerId"] = data["JerseyNumber"].astype(str) + "" + data["NflId"].astype(str)
data["UniquePlayId"] = data["GameId"].astype(str) + "" + data["PlayId"].astype(str)
Code this: Type the two lines above under part g in your function.
Code this: Great, we're getting close to the final step. At this point we should drop all the columns left over as trash from our above cleaning. However, instead of dropping them, it is easier to just take out what we need and leave the rest. Here's a list of all the columns we need to include at this point from the processing you did before, and you can simply select only these columns from the data.
["GameId", "PlayId", "X", "Y", "S", "A", "Dis", "Orientation", "Dir", "Season", "YardLine", "Quarter", "GameClock", "Down", "Distance", "OffenseFormation", "OffensePersonnel", "DefendersInTheBox", "DefensePersonnel", "TimeHandoff", "TimeSnap", "Yards", "PlayerHeight", "PlayerWeight", "PlayerCollegeName", "Position", "IsRusher", "OffensiveScore", "DefensiveScore", "IsOnOffense", "PlayerId", "UniquePlayId"]
Finaly step! Here we need to reshape our dataframe so that we can get information from all 22 players spread across all 22 rows into one row. This is called tidying the data. There are many ways to do this, some require a few short lines and other many. Resist the temptation of for-looping through the whole thing as it is both slow and a pain to implement. Also resist the temptation of the using numpy-style reshaping, this will make sorting players within each play a pain.
Here are some specifications for the final dataset:
Code this: Tidy the data. If you're feeling stuck reach out for help.
Code this: Finally, return your data.
At this point, your processed data should look like this:
There should be 23171 rows since there are 23171 plays total.
Each row should have the following "meta" information about the play:
And the following information for EACH player, with the player numbering in front, of course:
Great job! You've gone through the entire process of data exploration, cleaning, and feature engineering. In real world ML projects these are crucial steps for success!
Of course, there is always more to do (there is no end to ML)! If you get extra time, try to compute the distances between every defender and the rusher, as well as the shortest possible time it might take for a defender to tackle the rusher. These are important features too!